DATS 6103 - Individual Project 2 - Lucia Illari (G24308643)

Wine is increasinly enjoyed by a wider and wider audience, and Portugual in particular comes in at 9th for the top wine exporting countries (source). Focusing on vinho verde, exports of this wine suprassed 50% of total sales in 2019 and France, Canada, the United Kingdom, and Japan are the biggest buyers (source). What makes this wine so popular? What physicochemical qualities contribute to its sensory quality? In essence, what makes these wines so good?
Thus, the purpose of this project is to delve into understanding the physicochemical and sensory characteristics of red and white variants of the Portuguese "Vinho Verde" wine. The end goal will be to determine if these characteristics cannot be used to predict the quality of the wine. Unfortunately, due to privacy and logistic issues, there is no data about grape types, wine brand, wine selling price, etc.
The original dataset can be found over here. There are 1599 red wine instances and 4898 white wine instances, with 11 features and an output column called quality. This sensory preference column could theoretically range from 1 to 10, but in reality the values range from 3 to 9. The input variables are:
As red and white wine have different flavor profiles, it makes sense to keep the two datasets separate, rather than merging them. What makes a good white wine might make a terrible red wine. Further, several of the attributes may be correlated, so it would make sense to explore feature selection down the line.
First thing's first, let us import the necessary packages as well as the data:
#importing necessary packages
import numpy as np
import pandas as pd
import seaborn as sns
import tkinter as tk
from matplotlib.figure import Figure
from matplotlib.backends.backend_tkagg import (FigureCanvasTkAgg, NavigationToolbar2Tk)
import matplotlib.pyplot as plt
from tabulate import tabulate
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids
import plotly.express as px
from kneed import KneeLocator
from sklearn.metrics import silhouette_score
from scipy.signal import find_peaks
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import recall_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
#set seed
random_seed = 42
%matplotlib notebook
#%matplotlib inline
#want to ignore warnings
import warnings
warnings.filterwarnings('ignore')
#importing in the datasets using pandas
#looking at the dataset we see the delimiter is actually a semicolon instead of a comma
red = pd.read_csv('winequality-red.csv', sep=';')
white = pd.read_csv('winequality-white.csv', sep=';')
white.head()
Let's check for any missing data:
#if the dataframe has an empty string, we want to replace it with NaN
white.replace('', np.nan, inplace=True)
red.replace('', np.nan, inplace=True)
#we want to print out the total number of NaNs in the dataframe
print("There are",white.isnull().sum().sum(),"NaN elements in the white wine dataset.")
print("There are",red.isnull().sum().sum(),"NaN elements in the red wine dataset.")
Great, there is no missing data in either of the datasets.
Now the quality column is numeric, so I want to go ahead and create a categorical column that has binned the values into quality groups. There are a lot of average wines (5, 6 ratings) in this data set so we want to understand what makes the very best stand out, and make a cut-off of 7 and up be a "good" wine, and 6 and below an "okay" or "bad" wine, like so:
| category | values |
|---|---|
| bad | 1-6 |
| good | 7-10 |
We could do a more fine grained rating, such as very poor, poor, average, good, excellent or even just poor, average, good, but we want to get at the heart of what separates the good ones from all the averages, especially since there are not very many good or poor wines in either dataset.
print("For the white wine dataset, for each sensory rating, here are the counts:\n",white['quality'].value_counts())
print("\nFor the red wine dataset, for each sensory rating, here are the counts:\n",red['quality'].value_counts())
For both datasets there are 7 and 8 ratings, though only for the white wine dataset are there 9 ratings. These does mean however that our classes are severely unbalanced.
Regardless, let's go ahead and make our good/bad columns.
#fill the new column with no data
white['good_bad'] = np.nan
red['good_bad'] = np.nan
#replace the data in the new column
for i in list(range(0,len(white))):
if white['quality'][i] <= 6:
white['good_bad'][i] = 'bad'
if white['quality'][i] > 6:
white['good_bad'][i] = 'good'
for i in list(range(0,len(red))):
if red['quality'][i] <= 6:
red['good_bad'][i] = 'bad'
if red['quality'][i] > 6:
red['good_bad'][i] = 'good'
#want to move the good_bad column to be the first column
cols = white.columns.tolist()
cols = cols[-1:] + cols[:-1]
white = white[cols]
red = red[cols]
#want to view the data
white.head()
#grouping the data
gr_yt = white.iloc[:, 0:(len(cols)-1)].groupby(['good_bad'])
gr_red = red.iloc[:, 0:(len(cols)-1)].groupby(['good_bad'])
#we want to know the counts for each group
print("There are "+str(gr_yt.size()[0])+" bad white wines and "+str(gr_yt.size()[1])+" good white wines, meaning "+str(round((gr_yt.size()[1])/(gr_yt.size()[0])*100,2))+"% of white wines are 'good'.")
print("There are "+str(gr_red.size()[0])+" bad white wines and "+str(gr_red.size()[1])+" good white wines, meaning "+str(round((gr_red.size()[1])/(gr_red.size()[0])*100,2))+"% of white wines are 'good'.")
#print out the mean and median values
print("Mean values for white whine:")
print(tabulate(gr_yt.mean(), headers='keys', tablefmt='fancy_grid'))
print("\n")
print("Mean values for red whine:")
print(tabulate(gr_red.mean(), headers='keys', tablefmt='fancy_grid'))
print("\n")
print("Median values for white whine:")
print(tabulate(gr_yt.median(), headers='keys', tablefmt='fancy_grid'))
print("\n")
print("Median values for red whine:")
print(tabulate(gr_red.median(), headers='keys', tablefmt='fancy_grid'))
There isn't an interesting difference for every group - for example the density values for red and white wine are very similar, regardless if it's a good or bad wine, but for total sulfar dioxide, it appears the bad wines for white and red have higher values than the good wines, and from white to red there is almost a 100 point decrease in these values! Further, for both types of wine, the better wines have both higher mean and median values for alcohol - looks like the better the wine, the more boozy it is... This gives us an idea of what attributes might be less or more important later on, and that they might be different for the two wine categories. We might exclude free sulfur dioxide from our models for white wine, but include it for red wine, just looking at these values.
But looking at this information just in tables is a little boring, so let's make an interactive GUI to view the descriptive statistics.
def start_gui():
#the main window
root = tk.Tk()
#setting size
root.geometry("1000x825")
#not allowing resizing
root.resizable(0, 0)
#allows me to group and organize widgets
frame1 = tk.Frame(root)
frame2 = tk.Frame(root)
frame3 = tk.Frame(root)
frame4 = tk.Frame(root)
#window title
root.title('Descriptive statistics for vinho verde datasets')
#generate some labels
lbl1 = tk.Label(frame1, text = "Wine choice:")
lbl1.pack()
#functions nested like this because keeping track of buttons pressed
def wine_choice(opt):
#functions determining for which columns to output descriptive statistics
def describe(colm):
if opt == 'white':
res = white[colm].describe()
else:
res = red[colm].describe()
#displaying statistics
txt = "\nDescriptive statistics for {0} wine, {1}:\n\n{2}"
lbl2 = tk.Label(frame3, text = txt.format(opt,colm,res))
lbl2.pack()
def b_plot():
#figure that will contain the plot
fig = Figure(figsize = (5, 5), dpi = 75)
#setting the subplot grid parameters
p1 = fig.add_subplot(211)
p2 = fig.add_subplot(212)
if opt == 'white':
grouping = white[['good_bad',colm]].groupby(['good_bad'])
dat_g = grouping.get_group('good')
dat_g = dat_g[colm]
dat_b = grouping.get_group('bad')
dat_b = dat_b[colm]
col = 'black'
else:
grouping = red[['good_bad',colm]].groupby(['good_bad'])
dat_g = grouping.get_group('good')
dat_g = dat_g[colm]
dat_b = grouping.get_group('bad')
dat_b = dat_b[colm]
col = 'red'
p1.hist(dat_g, weights=np.ones(len(dat_g)) / len(dat_g), color = col)
p1.set_xlabel(colm)
p1.set_ylabel('probability')
title = 'Histogram of {0} for good {1} wine'
p1.set_title(title.format(colm,opt))
p2.hist(dat_b, weights=np.ones(len(dat_b)) / len(dat_b), color = col)
p2.set_xlabel(colm)
p2.set_ylabel('probability')
title = 'Histogram of {0} for bad {1} wine'
p2.set_title(title.format(colm,opt))
fig.tight_layout()
#creating the canvas containing figure and placing on the window
canvas = FigureCanvasTkAgg(fig, root)
canvas.draw()
canvas.get_tk_widget().pack(side="top")
btn_r = tk.Button(frame4, text="Restart", width=10, height=2)
btn_r.pack(side="bottom")
btn_r.bind('<Button-1>', lambda e: click('YES'))
def click(yn):
btn_r.config(command=restart())
#button for plot
btn_p = tk.Button(frame3, command = b_plot, width=10, height=3, text = "Histogram").pack()
lbl3 = tk.Label(frame2, text = "Pick an attribute to investigate:")
lbl3.pack()
#spawn attribute buttons after user chooses a wine
#generate buttons
btn3 = tk.Button(frame2, text='fixed acidity', width=10, height=3)
btn3.pack(side="left")
#this keeps track of which button is pressed
btn3.bind('<Button-1>', lambda e: describe('fixed acidity'))
btn4 = tk.Button(frame2, text='volatile\nacidity', width=10, height=3)
btn4.pack(side='left')
btn4.bind('<Button-1>', lambda e: describe('volatile acidity'))
btn5 = tk.Button(frame2, text='citric\nacid', width=10, height=3)
btn5.pack(side='left')
btn5.bind('<Button-1>', lambda e: describe('citric acid'))
btn6 = tk.Button(frame2, text='residual\nsugar', width=10, height=3)
btn6.pack(side='left')
btn6.bind('<Button-1>', lambda e: describe('residual sugar'))
btn7 = tk.Button(frame2, text='chlorides', width=10, height=3)
btn7.pack(side='left')
btn7.bind('<Button-1>', lambda e: describe('chlorides'))
btn8 = tk.Button(frame2, text='free\nsulfur\ndioxide', width=10, height=3)
btn8.pack(side='left')
btn8.bind('<Button-1>', lambda e: describe('free sulfur dioxide'))
btn9 = tk.Button(frame2, text='total\nsulfur\ndioxide', width=10, height=3)
btn9.pack(side='left')
btn9.bind('<Button-1>', lambda e: describe('total sulfur dioxide'))
btn10 = tk.Button(frame2, text='density', width=10, height=3)
btn10.pack(side='left')
btn10.bind('<Button-1>', lambda e: describe('density'))
btn11 = tk.Button(frame2, text='pH', width=10, height=3)
btn11.pack(side='left')
btn11.bind('<Button-1>', lambda e: describe('pH'))
btn12 = tk.Button(frame2, text='sulphates', width=10, height=3)
btn12.pack(side='left')
btn12.bind('<Button-1>', lambda e: describe('sulphates'))
btn13 = tk.Button(frame2, text='alcohol', width=10, height=3)
btn13.pack(side='left')
btn13.bind('<Button-1>', lambda e: describe('alcohol'))
btn14 = tk.Button(frame2, text='quality', width=10, height=3)
btn14.pack(side='left')
btn14.bind('<Button-1>', lambda e: describe('quality'))
#buttons for wine choices
btn1 = tk.Button(frame1, text = "white", width=10, height=2)
btn1.pack(side='left')
#remember which button user picks
btn1.bind('<Button-1>', lambda e: wine_choice('white'))
btn2 = tk.Button(frame1, text = "red", width=10, height=2)
btn2.pack(side='left')
btn2.bind('<Button-1>', lambda e: wine_choice('red'))
def restart():
root.destroy()
start_gui()
frame1.pack(padx=1,pady=1)
frame2.pack(padx=1,pady=1)
frame3.pack(padx=1,pady=1)
frame4.pack(padx=1,pady=1)
#must be called for window to be drawn and events to be processed
root.mainloop()
start_gui()
This is another way to visualize the information in the mean and median tables, essentially.
for i in cols[1:(len(cols)-1)]:
fig, ax = plt.subplots(1,2, sharey = True, figsize=(15,5))
p1 = sns.boxplot(x="good_bad", y=i,data=white, palette="cool", ax=ax[0], showmeans=True);
p1.set_title("Box plot for "+i+" for white wine")
p1.set(xlabel='quality of wine', ylabel=i)
p2 = sns.boxplot(x="good_bad", y=i, data=red, palette="OrRd_r", ax=ax[1], showmeans=True);
p2.set_title("Box plot for "+i+" for red wine")
p2.set(xlabel='quality of wine', ylabel=i)
Now to check to see what attributes are highly correlated, and I am particularly interested to see if there are any highly correlated attributes with the quality of the wine.
corr_yt = white.corr(method='pearson')
corr_red = red.corr(method='pearson')
#print(tabulate(corr_yt, headers='keys', tablefmt='fancy_grid'))
fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))
hm1 = sns.heatmap(corr_yt,center=0, cmap=sns.diverging_palette(20, 220, n=200), square=False, annot=True, ax=ax[0], linewidths=.5)
hm1.set_xticklabels(hm1.get_xticklabels(),rotation=45,horizontalalignment='right');
#strange cut-oof of top and bottom rows that requires manually setting the y-axis
hm1.set_ylim(len(cols)-1, -0.5);
hm2 = sns.heatmap(corr_red,center=0,cmap=sns.diverging_palette(20, 220, n=200),square=False, ax=ax[1], annot=True, linewidths=.5)
hm2.set_xticklabels(hm2.get_xticklabels(),rotation=45,horizontalalignment='right');
hm2.set_ylim(len(cols)-0.5, -0.5);
Looks like alcohol is moderately positively correlated with quality and very strongly negatively correlated with denisty, while density is weakly negatively correlated with quality and very strongly positively correlated with residual sugars.
Let's just quickly look at some scatterplots.
sns.pairplot(white.iloc[:, 0:12], diag_kind="kde", kind="hist", hue="good_bad");
Often times the good and bad quality wines are on top of each other, or the bad quality wine has a tail of wines that break off from the pack. Using unsupervised machine learning to find clusters on this data might be difficult, and might prove more fruitful if we first reduce the dimensionality via PCA.
I'm interested in reducing the dimensionality of the dataset. By reducing the number of features, we will speed up our algorithms and improve their performance. Further, decreasing the number of attributes should reduce the noise. First, we need to standardize our data
features = cols[1:13]
#separating out the features
#in this case I will keep the `quality` feature because this is unsupervized
x_yt = white.loc[:, features].values
x_red = red.loc[:, features].values
#separating out the target
y_yt = white.loc[:,['good_bad']].values
y_red = red.loc[:,['good_bad']].values
#standardizing the features
scal = StandardScaler()
x_yt_t = scal.fit_transform(x_yt)
x_red_t = scal.fit_transform(x_red)
Next we need to determine how many components we are keeping.
#transforming the features
pca = PCA()
pca_yt = pca.fit(x_yt_t)
x_yt_t = pca.transform(x_yt_t)
pca_red = pca.fit(x_red_t)
x_red_t = pca.transform(x_red_t)
#plotting the explained variance by components
fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))
l1 = sns.lineplot(data = pca.fit(x_yt_t).explained_variance_ratio_.cumsum(), ax=ax[0], marker='o', color='black', linewidth=2.5);
l1.set_title("Exlained variance by components for white wine");
l1.set(xlabel='Number of components', ylabel='Cumulative explained variance');
l2 = sns.lineplot(data = pca.fit(x_red_t).explained_variance_ratio_.cumsum(), ax=ax[1], marker='o', color='red', linewidth=2.5);
l2.set_title("Exlained variance by components for red wine");
l2.set(xlabel='Number of components', ylabel='Cumulative explained variance');
We want to preserve at least 80% of the variance so for both wines we will keep 6 components. Now we can perform PCA.
#keeping 6 components
pc_yt = PCA(n_components=6).fit_transform(x_yt_t)
pc_red = PCA(n_components=6).fit_transform(x_red_t)
pdf_yt = pd.DataFrame(data = pc_yt, columns = ['pc1', 'pc2','pc3', 'pc4','pc5','pc6'])
pdf_red = pd.DataFrame(data = pc_red, columns = ['pc1', 'pc2','pc3', 'pc4','pc5','pc6'])
fdf_yt = pd.concat([pdf_yt, white[['good_bad']]], axis = 1)
fdf_red = pd.concat([pdf_red, red[['good_bad']]], axis = 1)
Great, let's go ahead and investigate the effects of each feature on each component.
x_yt_s = scal.fit_transform(x_yt)
pca_6_yt = PCA(n_components=6)
pca_6_yt.fit(x_yt_s)
x_yt_pca=pca_6_yt.transform(x_yt_s)
x_r_s = scal.fit_transform(x_red)
pca_6_r = PCA(n_components=6)
pca_6_r.fit(x_r_s)
x_r_pca=pca_6_r.transform(x_r_s)
#plot of the effects of features on each component
fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))
hm3 = sns.heatmap(pca_6_yt.components_,
xticklabels=cols[1:12],
yticklabels=["PC "+str(x) for x in range(1,7)],
center=0,
cmap=sns.diverging_palette(20, 220),
square=False, ax=ax[0], linewidths=.5);
hm3.set_xticklabels(hm3.get_xticklabels(),rotation=45,horizontalalignment='right');
hm3.set_ylim(6, -0.5);
hm3.set_title("Effect of features on each component for white wine ");
hm3.set(xlabel='Features', ylabel='Components');
hm4 = sns.heatmap(pca_6_r.components_,
xticklabels=cols[1:12],
yticklabels=["PC "+str(x) for x in range(1,7)],
center=0, cmap=sns.diverging_palette(20, 220),
square=False, ax=ax[1], linewidths=.5);
hm4.set_xticklabels(hm4.get_xticklabels(),rotation=45,horizontalalignment='right');
hm4.set_ylim(6, -0.5);
hm4.set_title("Effect of features on each component for white wine ");
hm4.set(xlabel='Features', ylabel='Components');
I'd like to visualize some of the components in 3D before moving on to KMeans and KMedoids.
fig = px.scatter_3d(fdf_yt, x='pc1', y='pc2', z='pc3',color='good_bad')
fig.show()
fig = px.scatter_3d(fdf_yt, x='pc4', y='pc5', z='pc6',color='good_bad')
fig.show()
wcss_yt=[]
wcss_red=[]
for i in range(1,11):
km_yt = KMeans(n_clusters=i)
km_red = KMeans(n_clusters=i)
km_yt.fit(pc_yt)
km_red.fit(pc_red)
wcss_yt.append(km_yt.inertia_)
wcss_red.append(km_red.inertia_)
fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))
l3 = sns.lineplot(data = wcss_yt, ax=ax[0], marker='o', color='black', linewidth=2.5);
l3.set_title("KMeans with PCA clustering for white wine");
l3.set(xlabel='Number of clusters', ylabel='WCSS');
l4 = sns.lineplot(data = wcss_red, ax=ax[1], marker='o', color='red', linewidth=2.5);
l4.set_title("KMeans with PCA clustering for red wine");
l4.set(xlabel='Number of clusters', ylabel='WCSS');
kl_yt = KneeLocator(range(1, 11), wcss_yt, curve="convex", direction="decreasing")
kl_red = KneeLocator(range(1, 11), wcss_red, curve="convex", direction="decreasing")
print("Via Elbow Method optimal clusters for white wine is "+str(kl_yt.elbow))
print("Via Elbow Method optimal clusters for red wine is "+str(kl_red.elbow))
sc_yt=[]
sc_red=[]
for i in range(2,11):
km_yt = KMeans(n_clusters=i)
km_red = KMeans(n_clusters=i)
km_yt.fit(pc_yt)
km_red.fit(pc_red)
score_yt = silhouette_score(pc_yt, km_yt.labels_)
score_red = silhouette_score(pc_red, km_red.labels_)
sc_yt.append(score_yt)
sc_red.append(score_red)
fig, ax = plt.subplots(1,2, sharey = False, figsize=(20,8))
l5 = sns.lineplot(data = sc_yt, ax=ax[0], marker='o', color='black', linewidth=2.5);
l5.set_title("KMeans with PCA clustering for white wine");
l5.set(xlabel='Number of clusters', ylabel='Silhouette Score');
l6 = sns.lineplot(data = sc_red, ax=ax[1], marker='o', color='red', linewidth=2.5);
l6.set_title("KMeans with PCA clustering for red wine");
l6.set(xlabel='Number of clusters', ylabel='Silhouette Score');
p_yt, prop_yt = find_peaks(sc_yt, height=0)
print("Via Silhouette Scores optimal clusters for white wine is "+str(p_yt[0]))
p_red, prop_red = find_peaks(sc_red, height=0)
print("Via Silhouette Scores optimal clusters for red wine is "+str(p_red[0])+" and "+str(p_red[1]))
Now we need to implement KMeans. I will try 3 and 4 clusters for the white wine data, and 2 and 5 clusters for red wine.
x_y = white.loc[:, features].values
x_r = red.loc[:, features].values
x_y_t = scal.fit_transform(x_y)
x_r_t = scal.fit_transform(x_r)
pca.fit(x_y_t)
x_y_t = pca.transform(x_y_t)
pca.fit(x_r_t)
x_r_t = pca.transform(x_r_t)
#employing the kmeans algorithm
k_y_3 = KMeans(n_clusters=3)
k_y_3.fit(x_y_t)
#adding cluster grouping to instance
fdf_yt_c3 = pd.concat([fdf_yt, pd.DataFrame(data=k_y_3.labels_,columns=['cluster']) ], axis = 1)
k_y_4 = KMeans(n_clusters=4)
k_y_4.fit(x_y_t)
fdf_yt_c4 = pd.concat([fdf_yt, pd.DataFrame(data=k_y_4.labels_,columns=['cluster']) ], axis = 1)
k_r_2 = KMeans(n_clusters=2)
k_r_2.fit(x_r_t)
fdf_red_c2 = pd.concat([fdf_red, pd.DataFrame(data=k_r_2.labels_,columns=['cluster']) ], axis = 1)
k_r_5 = KMeans(n_clusters=5)
k_r_5.fit(x_r_t)
fdf_red_c5 = pd.concat([fdf_red, pd.DataFrame(data=k_r_5.labels_,columns=['cluster']) ], axis = 1)
Now that we've done that, I want to visualize the clusters in a 3D plot, along with the means/centers in feature space.
print("KMeans with k=3 for white wine")
fig = px.scatter_3d(fdf_yt_c3, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()
m_3_y = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(k_y_3.cluster_centers_)),columns=cols[1:13])
m_3_y['cluster'] = m_3_y.index
m_3_y
print("KMeans with k=4 for white wine")
fig = px.scatter_3d(fdf_yt_c4, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()
m_4_y = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(k_y_4.cluster_centers_)),columns=cols[1:13])
m_4_y['cluster'] = m_4_y.index
m_4_y
print("KMeans with k=2 for red wine")
fig = px.scatter_3d(fdf_red_c2, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()
m_2_r = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(k_r_2.cluster_centers_)),columns=cols[1:13])
m_2_r['cluster'] = m_2_r.index
m_2_r
print("KMeans with k=5 for red wine")
fig = px.scatter_3d(fdf_red_c5, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()
m_5_r = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(k_r_5.cluster_centers_)),columns=cols[1:13])
m_5_r['cluster'] = m_5_r.index
m_5_r
I will do 2 clusters for white and red wine respectively, however now it will be done using KMedoids. This is because before it did not feel that the additional clusters were very meaningful for this dataset. For KMeans, the center didn't have to be an actual point in the data, but for KMedoid it will be.
km_y_2 = KMedoids(n_clusters=2)
km_y_2.fit(x_y_t)
fdf_y_c2 = pd.concat([fdf_yt, pd.DataFrame(data=km_y_2.labels_,columns=['cluster']) ], axis = 1)
km_r_2 = KMedoids(n_clusters=2)
km_r_2.fit(x_r_t)
fdf_r_c2 = pd.concat([fdf_red, pd.DataFrame(data=km_r_2.labels_,columns=['cluster']) ], axis = 1)
print("KMedoids with k=2 for white wine")
fig = px.scatter_3d(fdf_y_c2, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()
me_2_y = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(km_y_2.cluster_centers_)),columns=cols[1:13])
me_2_y['cluster'] = me_2_y.index
me_2_y
print("KMedoids with k=2 for red wine")
fig = px.scatter_3d(fdf_r_c2, x='pc1', y='pc2', z='pc3',color='cluster')
fig.show()
me_2_r = pd.DataFrame(scal.inverse_transform(pca.inverse_transform(km_r_2.cluster_centers_)),columns=cols[1:13])
me_2_r['cluster'] = me_2_r.index
me_2_r
When we look, there are definitely some features that change a bit, such as for red wine, there is a point difference between alcohol values between cluster 0 and 1, a nearly 1 point difference between fixed acidity values, and a 16 point difference beween total sulfur density values. Other categories, such as free sulfur dioxide, does not change between clusters. So these changes make a difference bewteen an average wine and a slightly above average wine This, of course, doesn't quite help us on our quest to determine what features and values make a great wine, but they are groupings that would have been difficult for a human to select out.
Now we can move forward with clustering.
The classic and most basic is logistic regression when the response variable is categorical in nature, and we employ it here to act as a base comparison for other supervized ml algorithms.
#need to remove `quality` from this data
features = cols[1:12]
x_yt = white.loc[:, features].values
#we need to split the data into train-val-test datasets
#I go for a 70-30 split here
#train to fit the model and test to evaluate it
x_white_train, x_white_test, y_white_train, y_white_test = train_test_split(x_yt, y_yt, train_size=0.7, random_state=random_seed)
#need to scale the data
x_white_train = scal.fit_transform(x_white_train)
#creating and training the model
model_lr_y = LogisticRegression(solver='saga',C=0.01,multi_class='auto',class_weight='balanced',random_state=random_seed).fit(x_white_train,y_white_train);
#evaluating the model
x_white_test = scal.transform(x_white_test)
y_white_pred = model_lr_y.predict(x_white_test)
print("For Logit Reg for white wine the training accuracy is "
+str(round(model_lr_y.score(x_white_train, y_white_train)*100,2))
+"% and the test accuracy is "
+str(round(model_lr_y.score(x_white_test, y_white_test)*100,2))+"%.")
#same with red wine data
x_red = red.loc[:, features].values
x_red_train, x_red_test, y_red_train, y_red_test = train_test_split(x_red, y_red, train_size=0.7, random_state=random_seed)
x_red_train = scal.fit_transform(x_red_train)
model_lr_r = LogisticRegression(solver='saga',multi_class='auto',class_weight='balanced',random_state=random_seed).fit(x_red_train,y_red_train);
x_red_test = scal.transform(x_red_test)
y_red_pred = model_lr_r.predict(x_red_test)
print("For Logit Reg for red wine the training accuracy is "
+str(round(model_lr_r.score(x_red_train, y_red_train)*100,2))
+"% and the test accuracy is "
+str(round(model_lr_r.score(x_red_test, y_red_test)*100,2))+"%.")
I wanted to check both the test and training accuracy to make sure the training accuracy wasn't much higher than the test accuracy, which would indicate that overfitting was occurring. That, however, is clearly not the case here. Let's check the confusion matrix for a closer look at the false/true positives and false/true negatives.
fig, (ax1, ax2) = plt.subplots(1, 2, sharey=False, figsize=(15,5))
plot_confusion_matrix(model_lr_y, x_white_test, y_white_test, ax=ax1);
ax1.set_title("Confusion matrix for logit reg, white wine");
plot_confusion_matrix(model_lr_r, x_red_test, y_red_test, ax=ax2);
ax2.set_title("Confusion matrix for logit reg, red wine");
print("For white wine:")
print(classification_report(y_white_test, y_white_pred))
print("\n\nFor red wine:")
print(classification_report(y_red_test, y_red_pred))
These are not bad values, all told. The recall score are all above 0.5, and for red wine it got as high as 0.87, and the closer to 1, the better. However, the precision for good wines is dreadfully low, which means we are mostly classifying everything as a bad wine, and not really picking out what makes a wine good using this machine learning algorithm. Let's take a look to see how the features contribute to the model.
print("Logit Reg coeeficients for white wine model:")
pd.DataFrame(data={'features': features, 'coefficients': model_lr_y.coef_[0]})
print("Logit Reg coeeficients for red wine model:")
pd.DataFrame(data={'features': features, 'coefficients': model_lr_r.coef_[0]})
For both, alcohol positively contributes to the quality of the wine, as suspected. For red wine, fixed acidity also positively contributes whereas it doesn't at all for white wine. For both, volatile acidity negatively contributes but more so for red wine, and the same for denisty. For red wine, while sulphates contribute quite a bit to classifying a wine as good, it only weakly does so for white wine. Thus, as we also suspected, what makes for a good red wine does not necessarily make a good white wine...
However, the precision for good wines with this logit reg models is not so great, so now we turn to Decision Trees to see if we can't do better.
features = cols[1:12]
x_yt = white.loc[:, features].values
x_white_train, x_white_test, y_white_train, y_white_test = train_test_split(x_yt, y_yt, train_size=0.7, random_state=random_seed)
decision_tree_yt = DecisionTreeClassifier(criterion='gini',splitter='random',random_state=random_seed,min_samples_leaf=10);
decision_tree_yt.fit(x_white_train, y_white_train);
acc_decision_tree_yt = decision_tree_yt.score(x_white_test, y_white_test)
print("Decision tree accuracy for white wine is "+str(round(acc_decision_tree_yt*100,3)))
x_red = red.loc[:, features].values
x_red_train, x_red_test, y_red_train, y_red_test = train_test_split(x_red, y_red, train_size=0.7, random_state=random_seed)
decision_tree_red = DecisionTreeClassifier(criterion='gini',splitter='random',random_state=random_seed);
decision_tree_red.fit(x_red_train, y_red_train);
acc_decision_tree_red = decision_tree_red.score(x_red_test, y_red_test)
print("Decision tree accuracy for red wine is "+str(round(acc_decision_tree_red*100,3)))
We have a problem, however. This tree is incredibly deep and very difficult to interpret. Often times when the gini criterion equals 0, theres 1 or 4 samples in the node. How is that meaningful? Therefor, for something easier to interpret visually, we must limit the depth of the tree.
depth=3
m_split = 5
feat = 'sqrt'
decision_tree_y = DecisionTreeClassifier(criterion='gini',splitter='random',max_depth=depth,min_samples_leaf=m_split,max_features=feat,random_state=random_seed);
decision_tree_y.fit(x_white_train, y_white_train);
acc_decision_tree_y = decision_tree_y.score(x_white_test, y_white_test)
print("Decision tree with a max depth of "+str(depth)+" and minimum number of samples of "+str(m_split)+" accuracy for white wine is "+str(round(acc_decision_tree_y*100,3))+"%.")
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(decision_tree_y,
feature_names=features,
class_names=['good','bad'],
filled=True, label='all',impurity=True)
fig.savefig("dt_white_depth3_samp5.png")
decision_tree_r = DecisionTreeClassifier(criterion='gini',splitter='random',max_depth=depth,min_samples_leaf=m_split,max_features=feat,random_state=3);
decision_tree_r.fit(x_red_train, y_red_train);
acc_decision_tree_r = decision_tree_r.score(x_red_test, y_red_test)
print("Decision tree with a max depth of "+str(depth)+" and minimum number of samples of "+str(m_split)+" accuracy for red wine is "+str(round(acc_decision_tree_r*100,3))+"%.")
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(decision_tree_r,
feature_names=features,
class_names=['good','bad'],
filled=True, label='all',impurity=True)
fig.savefig("dt_red_depth3_samp5.png")
If we ignore trying to make human-readable and -interpretable decision trees, technically the accuracy is better than logit reg.